Introduction

Vehicle Attributes and Effects on MPG

Column

Motivation

The purpose of this study to was to create a linear regression model based on vehicle data from 1985 Ward’s Automotive Yearbook in hopes of uncovering possible relationships between MPG (miles per gallon) and other vehicle attributes.

Through separating city and highway MPG, I hoped to uncover possible differences between how to the two rates are impacted by the different vehicle attributes given in the data.

The dataset in this study contains 205 observations, 6 of which were removed due to missing values, leading to grand total of 199 observations. Additionally, the dataset originally contained 26 variables, but I chose to only study 15 of these, subetting the data based upon high amounts of missing values and relevancy to the research question.

highway city fueltype aspiration wheelbase length width height curbweight enginesize bore stroke compressionratio horsepower peakrpm
27 21 gas std 88.6 168.8 64.1 48.8 2548 130 3.47 2.68 9.0 111 5000
27 21 gas std 88.6 168.8 64.1 48.8 2548 130 3.47 2.68 9.0 111 5000
26 19 gas std 94.5 171.2 65.5 52.4 2823 152 2.68 3.47 9.0 154 5000
30 24 gas std 99.8 176.6 66.2 54.3 2337 109 3.19 3.40 10.0 102 5500
22 18 gas std 99.4 176.6 66.4 54.3 2824 136 3.19 3.40 8.0 115 5500
25 19 gas std 99.8 177.3 66.3 53.1 2507 136 3.19 3.40 8.5 110 5500
25 19 gas std 105.8 192.7 71.4 55.7 2844 136 3.19 3.40 8.5 110 5500
25 19 gas std 105.8 192.7 71.4 55.7 2954 136 3.19 3.40 8.5 110 5500
20 17 gas turbo 105.8 192.7 71.4 55.9 3086 131 3.13 3.40 8.3 140 5500
22 16 gas turbo 99.5 178.2 67.9 52.0 3053 131 3.13 3.40 7.0 160 5500

Column

Variable Index

The following explanatory variables were the focus of our analysis:

  • Fuel Type: gas or diesel

  • Aspiration: standard (std) or turbo

  • Wheel Base: the horizontal distance (in.) between the centers of the front and rear wheels

  • Length: length (in.) of vehicle

  • Width: width (in.) of vehicle

  • Height: height (in.) of vehicle

  • Curb-weight: the published weight (lbs.) of a vehicle with a full tank of fuel and all fluids filled

  • Engine-size: the volume (cubic in.) of fuel and air that can be pushed through a car’s cylinders

  • Bore: diameter (in.) of engine’s cylinder

  • Stroke: depth (in.) of engine’s cylinder

  • Compression-ratio: ratio measuring how much cylinder volume is able to be compressed

  • Horsepower: the power an engine produces (550 ft-lbs per second)

  • Peak-RPM: the max speed an engine can spin (rotations per second)

Response Variable EDA

Column

Highway

ggplot(vehicles, aes(x = highway)) + geom_histogram(color = "white", fill = "darkred") + labs(x = "Highway MPG", y = "Count of Vehicles", title = "Distribution of Highway MPG") + theme_classic()

City

ggplot(vehicles, aes(x = city)) + geom_histogram(color = "white", fill = "blue") + labs(x = "City MPG", y = "Count of Vehicles", title = "Distribution of City MPG") + theme_classic()

Column

Explanation

From these two histograms, we see that both highway and city mpg have a relatively symmetric distribution. Although there appears to be a slight skew to the right, this skew is not significant enough to deny a normal distribution. In regard to shape, both histograms are similar with three peaks around the center of the distribution.

Given these results, we see nothing that would prevent this data from being viable for linear regression.

Correlation Exploration

Column

Highway

City

Column

Explanation of Collinearity

From these two correlation plots, we see that wheelbase, length, width, curb-weight, engine size, bore, and horsepower all have strong negative correlations with both highway and city mpg. However, many of these explanatory variables have strong positive correlations with each other, which could signify the presence of colinearity.

Thus, we should move forward with LASSO (Least Absolute Selection and Shrinkage Operator) model selection, which is a commonly-used remedy for regression models that posses colinearity.

Model Selection

Column

Lambda Estimate for Highway Model

Reduced Highway MPG Model

14 x 1 sparse Matrix of class "dgCMatrix"
                           s0
(Intercept)       3.395918097
fueltype          .          
aspiration        .          
wheelbase         .          
length            .          
width             0.004117396
height            .          
curbweight       -0.190851601
enginesize       -0.019474681
bore              0.005619811
stroke            0.003983555
compressionratio  0.069151866
horsepower        .          
peakrpm          -0.012633210

Column

Estimated Lambda for City Model

Reduced City MPG Model

14 x 1 sparse Matrix of class "dgCMatrix"
                           s0
(Intercept)       3.195646783
fueltype          .          
aspiration        .          
wheelbase         .          
length           -0.007060336
width             .          
height            .          
curbweight       -0.145091194
enginesize        .          
bore              .          
stroke            .          
compressionratio  0.080345176
horsepower       -0.074876843
peakrpm          -0.002452953

Column

Explanation

After transforming the response variable several times, I found that a logarithmic transformation worked best for both highway and city mpg. This, along with the LASSO model selection tool led to two relatively high R^2 values for both my models.

In the LASSO model selection for both models, the lambda values were chosen based on the smallest average SSE (Sum of Squared Errors) that were derived via cross validation.

Highway MPG -

  • Before utilizing a logarithmic transformation, I received an R^2 for the training data of 0.8583 and for the testing data, 0.7683.

  • After utilizing a logarithmic transformation, the R^2 for the training data increased to 0.8774 and for the testing data, 0.8460.

  • An interpretation of one of the LASSO-produced coefficients would be as a vehicles curb-weight increases by one pound, highway gas mileage, on average, will decrease by 0.1909 miles/gallon.

City MPG -

  • Before utilizing a logarithmic transformation, I received an R^2 for the training data of 0.8762 and for the testing data, 0.7696.

  • After utilizing a logarithmic transformation, the R^2 for the training data increased to 0.9013 and for the testing data, 0.8749.

  • An interpretation of one of the LASSO-produced coefficients would be as a vehicles curb-weight increases by one pound, city gas mileage, on average, will decrease by 0.1451 miles/gallon.

Model Assumptions

Subset EDA and Model Assumptions

Column

Peak RPM

Horsepower

Compression Ratio

Curb Weight

Length

Stroke

Bore

Engine Size

Column

Linearity

Normality

A-D Test and Conclusions


    Anderson-Darling normality test

data:  residuals1
A = 0.74013, p-value = 0.05262

    Anderson-Darling normality test

data:  residuals
A = 2.107, p-value = 2.214e-05

From the residual v. fitted value plots, we see that the error terms stay relatively consistent around the x-axis, rather than fanning out or showing any sign of a pattern that would imply non-constant variance. Thus, the asusmption of constant variance is not violated for both city and highway mpg.

Regarding the normality plots in the second tab, we see different results from our highway and city mpg models. For our highway mpg plot, we see a relatively linear pattern, which provides evidence that the assumption of normality is not violated. However, when we look at our city mpg plot, we see that our theoretical quantities appear to (near -1 and 1) fan outwards away from normality line. Thus, there is evidence that the assumption of normality is violated regarding our city mpg model. Additionally, these results are supported by the Anderson-Darling test results in this tab, with the highway mpg model giving a p-val > 0.05, and the city mpg failing to do so.

I attempted dataset standardization and various forms of response variable transformations for my city mpg model, but none were able to fix the violation of the normality assumption.

Other Models

Column

Highway Base Model


Call:
lm(formula = highway ~ . - city, data = vehicles)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.1464 -1.6446 -0.0511  1.4150 11.1314 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)      34.249759  15.839034   2.162  0.03187 *  
fueltypegas      17.106769   5.802061   2.948  0.00361 ** 
aspirationturbo  -1.617827   0.858188  -1.885  0.06098 .  
wheelbase         0.129459   0.086931   1.489  0.13813    
length           -0.146329   0.046231  -3.165  0.00181 ** 
width             0.093670   0.210913   0.444  0.65748    
height           -0.048190   0.122517  -0.393  0.69453    
curbweight       -0.007331   0.001401  -5.234 4.47e-07 ***
enginesize       -0.022690   0.015249  -1.488  0.13846    
bore             -0.528187   1.028506  -0.514  0.60818    
stroke            1.828172   0.744791   2.455  0.01503 *  
compressionratio  1.798445   0.399610   4.501 1.20e-05 ***
horsepower       -0.003549   0.015219  -0.233  0.81587    
peakrpm          -0.001914   0.000602  -3.180  0.00173 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.713 on 185 degrees of freedom
Multiple R-squared:  0.8556,    Adjusted R-squared:  0.8454 
F-statistic:  84.3 on 13 and 185 DF,  p-value: < 2.2e-16

City Base Model


Call:
lm(formula = highway ~ . - highway, data = vehicles)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.1984 -0.5271  0.0999  0.8444  3.4824 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)      14.1770035  8.0570108   1.760   0.0801 .  
city              0.9537563  0.0410704  23.222  < 2e-16 ***
fueltypegas      -2.4840675  3.0532233  -0.814   0.4169    
aspirationturbo  -1.9541253  0.4342658  -4.500 1.20e-05 ***
wheelbase        -0.0679359  0.0447789  -1.517   0.1309    
length            0.0379411  0.0246909   1.537   0.1261    
width             0.0565314  0.1066799   0.530   0.5968    
height           -0.0437759  0.0619626  -0.706   0.4808    
curbweight       -0.0019281  0.0007456  -2.586   0.0105 *  
enginesize       -0.0318438  0.0077221  -4.124 5.64e-05 ***
bore              0.1756890  0.5210439   0.337   0.7364    
stroke            0.7979347  0.3792776   2.104   0.0368 *  
compressionratio -0.0888076  0.2178282  -0.408   0.6840    
horsepower        0.0393347  0.0079155   4.969 1.53e-06 ***
peakrpm          -0.0006919  0.0003090  -2.239   0.0263 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.372 on 184 degrees of freedom
Multiple R-squared:  0.9633,    Adjusted R-squared:  0.9605 
F-statistic: 344.5 on 14 and 184 DF,  p-value: < 2.2e-16

Assumptions

Column

Simple Linear Regression Model Explanation and Conclusion

As seen in the model summaries to the left, using simple linear regression to model the data actually creates two strong R^2 values - 0.8556 for highway mpg, and 0.9633 for city mpg. However, the presence of colinearity in this model, as shown under deter simple linear regression from being a viable way to model the data.

Additionally, when comparing linearity, we see that these models’ error terms do not stay as consistent around the x-axis as the error terms from our log-LASSO models

Project Conclusion

We succeeded in creating a model that eliminates 84.60% of the total variability in highway mpg, and does not violate our constant-variance and normality assumptions.

However, even though we found a model that eliminates 87.49% of the total variability in city mpg and does not violate our constant variance assumption, the log-LASSO model for city mpg violated our normality assumption. This ultimately raises question to the validity of this model and its prediction abilities.

---
title: "Final Project"
author: "Jesse Devitt"
output: 
  flexdashboard::flex_dashboard:
    theme:
      version: 4
      bootswatch: cosmo
      primary: "blue"
    orientation: columns
    vertical_layout: fill
    source_code: embed
---

<style>
.chart-title {  /* chart_title  */
   font-size: 20px;
  }
body{  /* Normal  */
      font-size: 18px;
  }
</style>

```{r setup, include=FALSE}
library(flexdashboard)
library(shiny)
library(shinydashboard)
```

Introduction
===
<head>
    <base target = "_blank">
</head>

<font size=5>
**Vehicle Attributes and Effects on MPG**
</font>


Column {data-width=650}
-----------------------------------------------------------------------

### Motivation

The purpose of this study to was to create a linear regression model based on vehicle data from 1985 Ward's Automotive Yearbook in hopes of uncovering possible relationships between MPG (miles per gallon) and other vehicle attributes.

Through separating city and highway MPG, I hoped to uncover possible differences between how to the two rates are impacted by the different vehicle attributes given in the data.

The dataset in this study contains 205 observations, 6 of which were removed due to missing values, leading to grand total of 199 observations.
Additionally, the dataset originally contained 26 variables, but I chose to only study 15 of these, subetting the data based upon high amounts of missing values and relevancy to the research question.

```{r}
knitr::opts_chunk$set(echo = TRUE)
library(pacman)
library(tidyverse)
library(plotly)
library(corrplot)
library(RColorBrewer)
library(stats)
vehicles <- read.csv("~/Library/Mobile Documents/com~apple~CloudDocs/MTH 369/automobile/imports-85.data", header=FALSE)
vehicles <- vehicles %>% dplyr::select(-c(V1, V2, V3, V6, V7, V8, V9, V15, V16, V18, V26))
names(vehicles) <- c("fueltype", "aspiration", "wheelbase", "length", "width", "height", "curbweight", "enginesize", "bore", "stroke", "compressionratio", "horsepower", "peakrpm", "city", "highway")
vehicles$city <- as.numeric(vehicles$city)
vehicles$highway <- as.numeric(vehicles$highway)
vehicles$curbweight <- as.numeric(vehicles$curbweight)
vehicles$enginesize <- as.numeric(vehicles$enginesize)
vehicles$bore <- as.numeric(vehicles$bore)
vehicles$stroke <- as.numeric(vehicles$stroke)
vehicles$compressionratio <- as.numeric(vehicles$compressionratio)
vehicles$horsepower <- as.numeric(vehicles$horsepower)
vehicles$peakrpm <- as.numeric(vehicles$peakrpm)
vehicles$peakrpm[vehicles$peakrpm == "?"] <- NA
vehicles$horsepower[vehicles$horsepower == "?"] <- NA
vehicles$stroke[vehicles$stroke == "?"] <- NA
vehicles$bore[vehicles$bore == "?"] <- NA
vehicles$fueltype <- as.factor(vehicles$fueltype)
vehicles$aspiration <- as.factor(vehicles$aspiration)
vehicles <- vehicles[complete.cases(vehicles),]
vehicles <- vehicles[, c("city", names(vehicles)[-which(names(vehicles) == "city")])]
vehicles <- vehicles[, c("highway", names(vehicles)[-which(names(vehicles) == "highway")])]
standardized <- apply(vehicles[, 5:15], 2, function(x) (x-mean(x)) / sd(x))
v <- vehicles %>% dplyr::select(highway, city, fueltype, aspiration)
stan_vehicles <- cbind.data.frame(v, standardized)
knitr::kable(vehicles[1:10,])
```

Column {data-width=350}
-----------------------------------------------------------------------

### Variable Index

The following explanatory variables were the focus of our analysis:

- Fuel Type: gas or diesel

- Aspiration: standard (std) or turbo

- Wheel Base: the horizontal distance (in.) between the centers of the front and rear wheels

- Length: length (in.) of vehicle

- Width: width (in.) of vehicle

- Height: height (in.) of vehicle

- Curb-weight: the published weight (lbs.) of a vehicle with a full tank of fuel and all fluids filled 

- Engine-size: the volume (cubic in.) of fuel and air that can be pushed through a car's cylinders

- Bore: diameter (in.) of engine's cylinder

- Stroke: depth (in.) of engine's cylinder

- Compression-ratio: ratio measuring how much cylinder volume is able to be compressed

- Horsepower: the power an engine produces (550 ft-lbs per second)

- Peak-RPM: the max speed an engine can spin (rotations per second)

Response Variable EDA
===

Column {.tabset data-width=650}
---

### Highway

```{r}
ggplot(vehicles, aes(x = highway)) + geom_histogram(color = "white", fill = "darkred") + labs(x = "Highway MPG", y = "Count of Vehicles", title = "Distribution of Highway MPG") + theme_classic()
```

### City

```{r}
ggplot(vehicles, aes(x = city)) + geom_histogram(color = "white", fill = "blue") + labs(x = "City MPG", y = "Count of Vehicles", title = "Distribution of City MPG") + theme_classic()
```

Column {data-width=350}
---

### Explanation

From these two histograms, we see that both highway and city mpg have a relatively symmetric distribution. Although there appears to be a slight skew to the right, this skew is not significant enough to deny a normal distribution. In regard to shape, both histograms are similar with three peaks around the center of the distribution.

Given these results, we see nothing that would prevent this data from being viable for linear regression.

Correlation Exploration
===

Column {.tabset data-width=650}
---

### Highway

```{r, echo=FALSE}
highwaynumeric <- vehicles %>% select(-c(fueltype, aspiration, city))
m1 <- round(cor(highwaynumeric), 2)
corrplot(m1, method = c("number"),type="upper",main="Highway MPG",mar=c(0,0,1,0), number.cex = 0.5)
```

### City

```{r, echo=FALSE}
citynumeric <- vehicles %>% select(-c(fueltype, aspiration, highway))
m <- round(cor(citynumeric), 2)
corrplot(m, method = c("number"),type="upper",main="City MPG",mar=c(0,0,1,0), number.cex = 0.5)
```

Column {data-width=350}
---

### Explanation of Collinearity

From these two correlation plots, we see that wheelbase, length, width, curb-weight, engine size, bore, and horsepower all have strong negative correlations with both highway and city mpg. However, many of these explanatory variables have strong positive correlations with each other, which could signify the presence of colinearity.

Thus, we should move forward with LASSO (Least Absolute Selection and Shrinkage Operator) model selection, which is a commonly-used remedy for regression models that posses colinearity.

Model Selection
===

Column {data-width=300}
---

### Lambda Estimate for Highway Model

```{r, fig.align='center', echo=FALSE}
x<-as.matrix(stan_vehicles[,3:15])
y1<-log(vehicles$highway)

set.seed(2000)
proportion_split<-0.7
train<-sample(1:nrow(x), round(nrow(x)*proportion_split))

y1.train<-y1[train]
y1.test<-y1[-train]

x.train<-x[train,]
x.test<-x[-train,]

library(glmnet)
set.seed(2000)
cv.lasso1<-cv.glmnet(x.train, y1.train, alpha = 1)
#cv.lasso1$lambda.min
plot(cv.lasso1)
```

### Reduced Highway MPG Model

```{r, fig.align='center', echo=FALSE}
model1<-glmnet(x.train, y1.train, alpha = 1, lambda = cv.lasso1$lambda.min)
coef1<-coef(model1)

#to compute training SSE from LASSO regression
y_predictedtrain1 <- predict(model1, s = cv.lasso1$lambda.min, newx = x.train)
SSEtrain1<-sum((y_predictedtrain1-y1.train)^2)
residuals1 <- y_predictedtrain1 - y1.train

#Computing R-squared
SSTOtrain1<-sum((y1.train-mean(y1.train))^2)
R2train1<-1-SSEtrain1/SSTOtrain1

#to compute testing SSE from LASSO regression
y_predictedtest1 <- predict(model1, s = cv.lasso1$lambda.min, newx = x.test)
SSEtest1<-sum((y_predictedtest1-y1.test)^2)

#Computing R-squared
SSTOtest1<-sum((y1.test-mean(y1.test))^2)
R2test1<-1-SSEtest1/SSTOtest1

print(coef1)
```

Column {data-width=300}
---

### Estimated Lambda for City Model

```{r, fig.align='center', echo=FALSE}
library(MASS)

#bc<-boxcox(city~peakrpm+horsepower+compressionratio+curbweight+length, data = vehicles)
#lambda<-bc$x[which.max(bc$y)]

x<-as.matrix(stan_vehicles[,3:15])
y<-log(vehicles$city)

set.seed(2000)
proportion_split<-0.7
train<-sample(1:nrow(x), round(nrow(x)*proportion_split))

y.train<-y[train]
y.test<-y[-train]

x.train<-x[train,]
x.test<-x[-train,]

library(glmnet)
set.seed(2000)
cv.lasso<-cv.glmnet(x.train, y.train, alpha = 1)
#cv.lasso$lambda.min
plot(cv.lasso)
```

### Reduced City MPG Model

```{r, fig.align='center', echo=FALSE}
model<-glmnet(x.train, y.train, alpha = 1, lambda = cv.lasso$lambda.min)
coef<-coef(model)

#to compute training SSE from LASSO regression
y_predictedtrain <- predict(model, s = cv.lasso$lambda.min, newx = x.train)
SSEtrain<-sum((y_predictedtrain-y.train)^2)
residuals<-y_predictedtrain - y.train # fitted values are y_predicted

#Computing R-squared
SSTOtrain<-sum((y.train-mean(y.train))^2)
R2train<-1-SSEtrain/SSTOtrain

#to compute testing SSE from LASSO regression
y_predictedtest <- predict(model, s = cv.lasso$lambda.min, newx = x.test)
SSEtest<-sum((y_predictedtest-y.test)^2)

#Computing R-squared
SSTOtest<-sum((y.test-mean(y.test))^2)
R2test<-1-SSEtest/SSTOtest

print(coef)
```

Column {data-width=400}
---

### Explanation

After transforming the response variable several times, I found that a logarithmic transformation worked best for both highway and city mpg. This, along with the LASSO model selection tool led to two relatively high R^2 values for both my models. 

In the LASSO model selection for both models, the lambda values were chosen based on the smallest average SSE (Sum of Squared Errors) that were derived via cross validation.

Highway MPG - 

 - Before utilizing a logarithmic transformation, I received an R^2 for the training data of 0.8583 and for the testing data, 0.7683.
 
 - After utilizing a logarithmic transformation, the R^2 for the training data increased to 0.8774 and for the testing data, 0.8460.
 
 - An interpretation of one of the LASSO-produced coefficients would be as a vehicles curb-weight increases by one pound, highway gas mileage, on average, will decrease by 0.1909 miles/gallon.
 

City MPG - 

 - Before utilizing a logarithmic transformation, I received an R^2 for the training data of 0.8762 and for the testing data, 0.7696.

 - After utilizing a logarithmic transformation, the R^2 for the training data increased to 0.9013 and for the testing data, 0.8749.
 
 - An interpretation of one of the LASSO-produced coefficients would be as a vehicles curb-weight increases by one pound, city gas mileage, on average, will decrease by 0.1451 miles/gallon.

Model Assumptions
===

**Subset EDA and Model Assumptions**

Column {.tabset data-width=400}
---

### Peak RPM
```{r, fig.align='center', echo=FALSE}
ggplot(vehicles, aes(x = peakrpm)) + geom_point(aes(y = city, color = "City MPG"), size = 1) + geom_point(aes(y = highway, color = "Highway MPG"), size = 1 ) + scale_color_manual(values = c("City MPG" = "blue", "Highway MPG" = "darkred")) + labs(x = "Peak RPM (rotations/second)", y = "MPG", title = "Relationship Between MPG and Corresponding Peak RPM") + theme_classic()
```

### Horsepower
```{r, fig.align='center', echo=FALSE}
ggplot(vehicles, aes(x = horsepower)) + geom_point(aes(y = city, color = "City MPG"), size = 1) + geom_point(aes(y = highway, color = "Highway MPG"), size = 1 ) + scale_color_manual(values = c("City MPG" = "blue", "Highway MPG" = "darkred")) + labs(x = "Horsepower (550ft-lbs/second)", y = "MPG", title = "Relationship Between MPG and Corresponding Horsepower") + theme_classic()
```

### Compression Ratio
```{r, fig.align='center', echo=FALSE}
ggplot(vehicles, aes(x = compressionratio)) + geom_point(aes(y = city, color = "City MPG"), size = 1) + geom_point(aes(y = highway, color = "Highway MPG"), size = 1 ) + scale_color_manual(values = c("City MPG" = "blue", "Highway MPG" = "darkred")) + labs(x = "Compression Ratio", y = "MPG", title = "Relationship Between MPG and Corresponding Compression Ratio") + theme_classic()
```

### Curb Weight
```{r, fig.align='center', echo=FALSE}
ggplot(vehicles, aes(x = curbweight)) + geom_point(aes(y = city, color = "City MPG"), size = 1) + geom_point(aes(y = highway, color = "Highway MPG"), size = 1 ) + labs(x = "Curb Weight (lbs)", y = "MPG", title = "Relationship Between MPG and Corresponding Curb Weight") + scale_color_manual(values = c("City MPG" = "blue", "Highway MPG" = "darkred")) + theme_classic()
```

### Length
```{r, fig.align='center', echo=FALSE}
ggplot(vehicles, aes(x = length)) + geom_point(aes(y = city, color = "City MPG"), size = 1) +  scale_color_manual(values = c("City MPG" = "blue")) + labs(x = "Length (in)", y = "City MPG", title = "Relationship Between City MPG and Corresponding Length") + theme_classic() + theme(legend.position = "none")
```

### Stroke
```{r, fig.align='center', echo=FALSE}
ggplot(vehicles, aes(x = stroke)) + geom_point(aes(y = highway, color = "Highway MPG"), size = 1 ) + scale_color_manual(values = c("Highway MPG" = "darkred")) + labs(x = "Stroke (in)", y = "Highway MPG", title = "Relationship Between Highway MPG and Corresponding Stroke") + theme_classic() + theme(legend.position = "none")
```

### Bore
```{r, fig.align='center', echo=FALSE}
ggplot(vehicles, aes(x = bore)) + geom_point(aes(y = highway, color = "Highway MPG"), size = 1 ) + scale_color_manual(values = c("Highway MPG" = "darkred")) + labs(x = "Bore (in)", y = "Highway MPG", title = "Relationship Between Highway MPG and Corresponding Bore") + theme_classic() + theme(legend.position = "none")
```

### Engine Size
```{r, fig.align='center', echo=FALSE}
ggplot(vehicles, aes(x = enginesize)) + geom_point(aes(y = highway, color = "Highway MPG"), size = 1 ) + scale_color_manual(values = c("Highway MPG" = "darkred")) + labs(x = "Engine Size (cubic in)", y = "Highway MPG", title = "Relationship Between Highway MPG and Corresponding Engine Size") + theme_classic() + theme(legend.position = "none")
```


Column {.tabset data-width=600}
---

### Linearity

```{r, fig.align='center', echo=FALSE, out.width="50%"}
plot(residuals1~y_predictedtrain1, xlab = "Fitted Values", ylab = "Residuals", main = "Highway MPG", col = "darkred")
abline(h=0)

plot(residuals~y_predictedtrain, xlab = "Fitted Values", ylab = "Residuals", main = "City MPG", col = "blue")
abline(h=0)
```

### Normality

```{r, fig.align='center', echo=FALSE, out.width="50%"}
library(nortest)
qqnorm(residuals1, main = "Normal Q-Q Plot of Highway MPG")
qqline(residuals1, col = "darkred")

qqnorm(residuals, main = "Normal Q-Q Plot of City MPG")
qqline(residuals, col = "blue")
```

### A-D Test and Conclusions
```{r, fig.align='center', echo=FALSE, out.width="50%"}
ad.test(residuals1) #highway

ad.test(residuals) #city
```

From the residual v. fitted value plots, we see that the error terms stay relatively consistent around the x-axis, rather than fanning out or showing any sign of a pattern that would imply non-constant variance. Thus, the asusmption of constant variance is not violated for both city and highway mpg.

Regarding the normality plots in the second tab, we see different results from our highway and city mpg models. For our highway mpg plot, we see a relatively linear pattern, which provides evidence that the assumption of normality is not violated. However, when we look at our city mpg plot, we see that our theoretical quantities appear to (near -1 and 1) fan outwards away from normality line. Thus, there is evidence that the assumption of normality is violated regarding our city mpg model. Additionally, these results are supported by the Anderson-Darling test results in this tab, with the highway mpg model giving a p-val > 0.05, and the city mpg failing to do so.

I attempted dataset standardization and various forms of response variable transformations for my city mpg model, but none were able to fix the violation of the normality assumption.

Other Models
===

Column {.tabset data-width=600}
---

### Highway Base Model

```{r, fig.align='center', echo=FALSE}
base_highway <- lm(highway~.-city, data = vehicles)
summary(base_highway)
```

### City Base Model

```{r, fig.align='center', echo=FALSE}
base_city <- lm(highway~.-highway, data = vehicles)
summary(base_city)
```

### Assumptions

```{r, fig.align='center', echo=FALSE, out.width="50%"}
plot(base_highway$fitted.values, base_highway$residuals, col = "darkred", xlab = "Fitted Values", ylab = "Residuals", main = "Highway MPG")
abline(h=0)

plot(base_city$fitted.values, base_city$residuals, col = "blue", xlab = "Fitted Values", ylab = "Residuals", main = "City MPG")
abline(h=0)
```

Column {data-width=400}
---

**Simple Linear Regression Model Explanation and Conclusion**

As seen in the model summaries to the left, using simple linear regression to model the data actually creates two strong R^2 values - 0.8556 for highway mpg, and 0.9633 for city mpg. However, the presence of colinearity in this model, as shown under <Correlation Exploration> deter simple linear regression from being a viable way to model the data.

Additionally, when comparing linearity, we see that these models' error terms do not stay as consistent around the x-axis as the error terms from our log-LASSO models

**Project Conclusion**

We succeeded in creating a model that eliminates 84.60% of the total variability in highway mpg, and does not violate our constant-variance and normality assumptions. 

However, even though we found a model that eliminates 87.49% of the total variability in city mpg and does not violate our constant variance assumption, the log-LASSO model for city mpg violated our normality assumption. This ultimately raises question to the validity of this model and its prediction abilities.